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Motivated by genome- wide association studies we consider a stan- 
dard linear model with one additional random effect in situations 
where many predictors have been collected on the same subjects and 
each predictor is analyzed separately. Three novel contributions are 
(1) a transformation between the linear and log-odds scales which 
is accurate for the important genetic case of small effect sizes; (2) 
a likelihood-maximization algorithm that is an order of magnitude 
faster than the previously published approaches; and (3) efficient 

■ methods for computing marginal likelihoods which allow Bayesian 
' model comparison. The methodology has been successfully applied 

to a large-scale association study of multiple sclerosis including over 

■ 20,000 individuals and 500,000 genetic variants. 

-I— > 

1. Introduction. We describe computationally efficient methods to an- 
^ alyze one of the simplest linear mixed models: 

00 i (1.1) Y = X(3 + q + s, 

OO; 

"xl" \ where Y = (yi, . . . , y n Y is the vector of responses on n subjects, X = (xik) 

f*"^ 1 is the n x K matrix of predictor values on the subjects, j3 = . . . , (3k) T 

collects the (unknown) linear effects of the predictors on the responses Y 
and the random effects q and s are assigned the distributions 



(1.2) Q\(i],o- 2 ) ~ M(0,w R) and e\(7],a 2 ) ~ M(0, (1 - r])a 2 I 



5_j ■ Here R is a known positive semi-definite n x n matrix, i" is the nxn identity 

matrix and parameters a 2 > and r] € [0, 1] determine how the variance is 
divided between g and e. 

Originally this model arose to explain how the genetic component of a 
quantitative trait, such as height, is correlated between relatives (Fisher, 
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1918). Many extensions of the model have been thoroughly studied in genet- 
ics to estimate heritabilities of traits, breeding values of individuals and loca- 
tions of quantitative trait loci (see e.g. Lynch and Walsh (1998); Sorensen and Gianola 
(2002)). 

Recently, the model has been applied to genome-wide association stud- 
ies (GWAS) (Astle and Balding, 2009; Kang et al., 2008, 2010; Yu et al., 
2005; Zhang et al., 2009, 2010). GWAS measure genotypes at a large num- 
ber (500,000 - 1,000,000) of single-nucleotide polymorphisms (SNPs) in large 
samples of individuals, with the goal of identifying genetic variants that ex- 
plain variation in a phenotype (McCarthy et al., 2008). Typically GWAS 
data are analyzed by testing each SNP separately using standard linear 
or logistic regression models. However, these models become invalid if the 
ascertainment procedure itself introduces correlations between the pheno- 
type and the genetic background of the individuals. (See Astle and Balding 
(2009) for a detailed description of spurious associations in GWAS.) The 
linear mixed model (1.1) can reduce the confounding effects by using the 
covariance matrix R of the random effect g to model the genome-wide re- 
latedness between the samples. To emphasize the structure of the GWAS 
application we write the model as 

(1.3) Y = Cp c + X®i3 e + Q + e, 

where X^ contains the genetic data at the SNP I and the matrix C contains 
the non- genetic covariates, such as age and sex. The most common strategy 
is to set X^' equal to the number of copies of the minor allele at the SNP £, 
but also dominant, recessive or more complex genetic effects can be modeled 
in this framework. Even when the model needs to be analyzed for millions 
of different X^ matrices, one for each SNP, efficient computation becomes 
possible since the matrix R remains constant for a large number of the 
SNPs. 

Our work with this model is motivated by a large GWAS on multiple scle- 
rosis (20,119 individuals, 520,000 SNPs) which we explain in detail in Section 
2. This case-control data set required novel methodological and computa- 
tional contributions which, together with their applications in other genetics 
problems, are explained in the remaining sections of this paper. 

Section 3 gives a justification for applying the linear mixed model to 
binary data and introduces a way to transform the effect size estimates 
from the linear to log-odds scale. Such a transformation is crucial for a 
meaningful interpretation of the effect sizes and for combining the results 
with other separately analyzed data sets, for example, in a replication phase 
of GWAS or in a meta-analysis of several independent studies. 
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The large size of the typical GWAS puts a premium on computational 
efficiency. Section 4 describes a novel algorithm for likelihood analysis that 
reduces the computation time from hundreds of years, as would be required 
by the existing EMMA algorithm (Kang et al., 2008), to only a few days 
and is almost as fast as previous approximations to the model (Kang et al., 
2010; Zhang et al., 2010). With our implementation it is computationally 
feasible to determine when the full model is noticeably more powerful than 
the existing approximations as we demonstrate in Section 4. 

Bayesian approaches provide a natural way to utilize prior knowledge on 
the genetic architecture of common diseases (Stephens and Balding, 2009). 
In Section 5 we compute Bayes factors using the linear mixed model. The 
first application is in evaluating the genetic associations in the multiple 
sclerosis data set. The second application investigates when a non-zero her- 
itability can be convincingly detected in a large and only distantly related 
population sample of individuals. 

In the GWAS setting the challenge of combining data across genetically 
heterogeneous collections with strongly differing case-control ratios will be- 
come more routine as study sizes increase. We therefore hope that our results 
will be important in human genetics, and potentially also in other fields of 
science, where large amounts of heterogeneous data need to be analyzed 
efficiently. 

We have implemented the methodology in a software package called MMM 
which is freely available under the GNU General Public License. 

2. Motivating data set: Multiple sclerosis. Multiple sclerosis (MS) 
is a disease of the central nervous system that can manifest itself through 
a variety of neurological symptoms including, for example, motor problems, 
changes in sensation and chronic pain. The largest individual genetic effect 
is associated with a region of the major histocompatibility complex on chro- 
mosome 6, and about 20 additional risk loci for MS had been identified by 
the beginning of 2011. 

Recently we were involved in a large GWAS of MS (IMSGC and WTCCC2, 
2011). The study was divided into the UK component (1,854 cases and 5,175 
controls) and the non-UK component (7,918 cases and 12,201 controls) which 
were analyzed separately and combined via a fixed-effects meta-analysis. 
About 100 of the most promising signals among the 470,000 SNPs passing 
the quality control criteria were interrogated in an independent replication 
data set of 4,218 cases and 7,296 controls. 

A methodologically challenging part of the study was the non-UK compo- 
nent with 20,119 individuals of European ancestry collected from 14 different 
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Country 


Cases 


Controls 


Country 


Cases 


Controls 


Finland 


581 


2,165 


Australia 


647 




Sweden 


685 


1,928 


New Zealand 


146 




Norway 


953 


121 


Ireland 


61 




Denmark 


332 




USA 


1,382 


5,370 


Germany 


1,100 


1,699 


France 


479 


347 


Poland 


58 




Spain 


205 




Belgium 


544 




Italy 


745 


571 



Table 1 

The origins of the samples in the non- UK component of the MS study. 



countries. Table 1 shows that the case-control ratio varied strongly between 
the countries, with some collections consisting only of case samples. As a re- 
sult, standard meta-analysis approaches, where the samples from each coun- 
try are analyzed separately and the summary statistics combined, turned out 
to be inefficient. 

Alternative approaches, which jointly analyze data from several countries, 
are likely to suffer from confounding effects of population structure. Figure 
1 shows a small part of a genome- wide correlation matrix of the non-UK in- 
dividuals calculated from about 200,000 SNPs. Block-like structures on the 
diagonal show, unsurprisingly, that the similarity of the genomes correlates 
with the sampling locations. Since the case-control status also has a strong 
dependence on the sampling locations due to the ascertainment process (Ta- 
ble 1), spurious associations between SNPs and the phenotype will arise if 
the correlation structure in the data is not properly modeled. 

We explored several approaches to address this issue. First we conducted 
a meta-analysis on groups that had balanced case-control ratios and were 
genetically homogeneous, according to a model-based clustering algorithm. 
We also conducted logistic regression by including the seven leading principal 
components (PCs) of the population structure as covariates (Patterson, Price and Reich, 
2006). A standard way of checking GWAS analysis is based on the assump- 
tion that only a very small proportion of the variants affect the phenotype 
and therefore the test statistics of the majority of the variants should follow 
the null distribution (Devlin, Roeder and Wasserman, 2001). This assump- 
tion is often assessed through the "genomic control" parameter, A, defined as 
the ratio of the median of the observed test statistic distribution to that of 
the theoretical null distribution. A substantial inflation was observed with 
A = 1.44 for the clustering approach and A = 1.22 for the PC approach 
(Figure 1). Although some of the inflation was likely to reflect the poly- 
genic architecture of the disease (small genetic effects at very many vari- 
ants) (Yang et al., 2011), it remained likely that the underlying population 
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Fig 1. Left panel. An 80 x 80 submatrix of the genetic correlation matrix of the non- 
UK individuals in the MS study. Ten randomly chosen individuals are shown for each 
of the following countries: Finland, SWeden, GErmany, IReland, POland, FRance, ITaly 
and SPain. The colors correspond to the pair wise correlation coefficients according to the 
scale in the middle. The diagonal values are close to 1.0 and are colored white. 
Right panel. The association test statistics of 470,000 SNPs plotted from the 1st per- 
centile to the 50th (median). The null distribution on the X-axis is the chi-square with 1 
df. Methods are the linear mixed model (MM) and the logistic regression with 7 leading 
principal components of the population structure as covariates (IPCs). The line is y=x. 



structure was confounding the tests. 

The linear mixed model as presented in this paper provided a way to in- 
clude the whole estimated genetic correlation structure of 20,119 individuals 
in the regression model. The model-checking confirmed that the confound- 
ing effects were well controlled (A = 1.02, see Figure 1) while simultaneously 
the method maintained power to detect associations, as evidenced through 
the replication of over 20 previously-known associations. The main results 
of the MS GWAS, analyzed via the linear mixed model, included the identi- 
fication of 29 novel association signals. These signals had important biolog- 
ical consequences, with further analyses showing that immunological genes 
are significantly overrepresented near the identified loci. In particular, the 
findings highlight an important role for T-helper-cell differentiation in the 
pathogenesis of MS. Another striking pattern was the very substantial over- 
lap between genetic variants associated with MS and those associated with 
autoimmune diseases (see IMSGC and WTCCC2 (2011) for further details). 

3. Binary data. The linear mixed model (1.1) is formulated for a uni- 
variate quantitative response and therefore its application to binary case- 
control data requires further justification. A connection between the stan- 
dard linear model and the Armitage trend test (Armitage, 1955) that we de- 
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rive in the supplementary text (Pirinen, Donnelly and Spencer, 2012) adds 
to the work of Astle and Balding (2009) and Kang et al. (2010) who have 
previously used the mixed model for significance testing in case-control 
GWAS. In addition to testing it is also important to measure the effect 
sizes on a relevant scale. Next we explain how the output from the standard 
linear model can be turned into accurate effect size estimates on the log-odds 
scale, which is a natural scale for case-control studies. 

For 0-1 valued responses Y = (yi, . . . ,y n ) T a logistic regression model 
assumes that 

(3.1) B = j^ = 1 | X[7) "rf*7> , 

1 + exp(JTj7) 

where the row i of X is denoted by and the effects of the predictors are 
in the vector 7. The score function of the corresponding binomial likelihood 
for a set of independent observations is X T (Y — p) where p = (pi, . . . ,p n ) T 
is a function of 7. If we can justify a linear approximation p ~ Xf3, 
then the score becomes approximately zero at the least squares estimate 

= (X T X)~ x X T Y . In the supplementary text we argue that such an 
approximation is good when the logistic model effects 7 are small and we 
provide a connection between the parameters 7 and (3 in those cases. These 
steps allow us to use the output from the standard linear model (i.e. the 
least squares solution (3) to approximate the maximum likelihood estimates 
of the logistic regression model. For our GWAS application, where the case- 
control status is regressed on the population mean and the (mean-centered) 
reference allele count at a SNP, these considerations lead to the following 
estimate of the genetic effect on the log-odds scale: 

1 («i - *> + 0.5(1 -mi- »)? - oM+ow-vw-V py 1 . 

where cf> is the proportion of the cases in the data, 9 is the reference allele 
frequency in the data and /3 is the least squares estimate of the effect of the 
(mean-centered) reference allele count on the binary case-control status. 

To investigate how well this approximation works in typical GWAS set- 
tings we simulated case-control data for 5,000 unrelated individuals at 500 
SNPs for nine case proportions <p € {0.1, 0.2, . . . , 0.9}. The allelic log-odds 
ratios 7 were taken from an equally spaced grid on the interval correspond- 
ing to odds ratios in [1.0,1.3]. This range covers typical GWAS hits; for 
example, in our MS study the median effect size among the 52 reported 
associations was 1.11 (minimum 1.08, maximum 1.22). In our MS study the 
lowest minor allele frequency among the variants taken to replication was 



LINEAR MIXED MODELS WITH LARGE DATA SETS 



7 




Fig 2. Difference between the linear and the logistic model. The panels include results from 
2,500 (top-row) and 2,000 (bottom row) binary variants simulated as described in the text. 
The titles on the leftmost panels show the proportion of cases, (f>, y-axes show the relative 
differences between the linear and the logistic models in percentages and x-axes show the 
results from the logistic model. logOR, log-odds ratio; SE, standard error; -loglO(p), -loglO 
of the p-value from the likelihood-ratio test. 



4.6% which motivated us to sample the risk allele frequencies for the con- 
trols from a Beta(2,2) distribution, truncated to the interval (0.05, . . . , 0.95). 
The frequencies in cases were determined by assuming that each copy of the 
risk allele increases log-odds of the disease additively by 7. Both linear and 
logistic regression models were then applied to the data with the population 
mean and the sampled genotypes as predictors. The differences in log-odds 
estimates 7 and their standard errors together with the p-values from the 
likelihood-ratio tests are shown in Figure 2, where the parameter estimates 
from the linear model have been transformed according to formula (3.2). 

The conclusion from Figure 2 is that in a typical case-control GWAS data 
set where genetic effects are small, the case-control ratio is well-balanced 
and allele frequencies are not extreme (say, OR < 1.3, 0.30 < < 0.70 
and 0.05 < freq < 0.95), the standard linear model provides an accurate 
approximation of the corresponding logistic regression model. The relative 
errors in the log-odds estimates or their standard errors are at most around 
1% and in the -loglO p-values at most around 4% (top row of Figure 2). This 
result is useful because it suggests a natural way to apply the linear mixed 
model to binary data by using generalized least squares estimates (details 
in the supplementary text). The following empirical results show that this 
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|MM - replication| |MM - replication! 



Fig 3. Absolute differences of 93 effect sizes between multiple sclerosis non-UK discovery 
and replication studies. Scales are log-odds (Left panel) and standardized log-odds (Right 
panel). MM: the linear mixed model in the discovery data; IPCs: logistic regression with 7 
principal components of genetic structure as covariates in the discovery data; replication: 
the replication data analyzed with logistic regression. Points above the diagonal: 62/93 
(Left) and 59/93 (Right). 



procedure performed well in our application. 

In our multiple sclerosis study we took 93 independent SNPs to the repli- 
cation phase. The replication analysis was conducted with 4,218 cases and 
7,296 controls using logistic regression (for details see IMSGC and WTCCC2 
(2011)). Figure 3 shows the absolute difference between the effect sizes in 
replication analysis and in the non-UK part of the discovery analysis using 
the linear mixed model (x-axes) and logistic regression including 7 principal 
components as covariates (y-axes). The log-odds ratios estimated by the lin- 
ear mixed model were closer to the replication results in 62 out of 93 SNPs 
(one-sided binomial p- value 0.0009). The same pattern was present when the 
absolute differences are standardized (59 out of 93, p=0.006). This suggests 
that the methods presented here for estimating the log-odds ratios by the 
linear mixed model can lead to more accurate estimates than standard logis- 
tic regression analyses when the data contain complex correlation structure 
which, for practical reasons, cannot be fully included in a logistic regression 
model. 

Obtaining effect size estimates and their standard errors is critical in the 
genetics context both in interpreting the results of individual studies, and 
in combining results, via meta-analysis, across studies. 
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4. Maximum likelihood computation. The main analysis of our 
multiple sclerosis study was based on maximum likelihood (ML). In this 
section we consider how to efficiently maximize the likelihood function cor- 
responding to the sampling distribution 

(4.1) Y\(f3, a 2 , i]) ~ M(X(3, V a 2 R + (1 - 7])a 2 I), 

with respect to /3,rj and a 2 . 

In general, finding the ML estimates for linear mixed models requires iter- 
ative procedures with expensive matrix operations (p. 787 Lynch and Walsh 
(1998)), but for the particular model (4.1) more efficient algorithms can be 
found. To our knowledge the most efficient published algorithm is EMMA 
(Kang et al., 2008) which has been applied to several recent GWAS (Atwell et al., 
2010; Boyko et al., 2010). The algorithm FMM by Astle (2009), which is cur- 
rently being implemented in the software suite GenABEL (Aulchenko et al., 
2007), was faster than EMMA in our test cases but to date its exact compu- 
tational details have not been published. Another implementation of EMMA 
is in the software package TASSEL (currently v. 3.0) (Bradbury et al., 2007) 
which provides a graphical interface and several approximations to reduce 
the running time. 

Next we describe a novel conditional maximization algorithm which is an 
order of magnitude faster than EMMA and was also faster than FMM in our 
tests except with the smallest sample size of n = 250 individuals. We also 
consider in which situations the full ML estimation is more powerful than 
a recently proposed generalized least squares approximation (Kang et al., 
2010; Zhang et al., 2010), and compare the available methods. Finally we 
give running times on our MS data set. 

4.1. Conditional maximization. Our contribution to the ML estimation 
under the model (4.1) is to notice that by transforming the data and pre- 
dictors in such a way that the covariance matrix becomes diagonal we are 
able to implement an efficient conditional maximization procedure. This is a 
direct extension of the transformation that is used in general linear models 
to handle non-diagonal covariance matrices to a more general case of two 
variance components. 

The eigenvalue decomposition of the positive semi-definite matrix R yields 
an orthonormal n x n-matrix U of eigenvectors and a diagonal n x n-matrix D 
of non-negative eigenvalues for which R = UDU T (see Golub and Van Loan 
(1996)). Let us write Y = U T Y, X = U T X, and £ = rjD + (1 - rj)I. Then 
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the log-likelihood function is 
(4.2) 

L(/3, n,a 2 ) = c-^ log(a 2 ) - \ Iog(|E|) - ^ ( Y ~ X(3) t ^\y - Xf3), 

where c = — § log(27r) and |S| denotes the determinant of XI (details in the 
supplementary text). 

Note that U, D and Y are independent of X and that S is a diagonal ma- 
trix which allows efficient computation of the inverse and the determinant. 
After the eigenvalue decomposition of matrix R (complexity is C(n 3 )), for 
each X the computation of X requires 0(kn 2 ) operations where k < K is 
the number of columns of X that need to be recomputed, and for each set of 
values of the parameters the evaluation of the log-likelihood requires 0{nK) 
operations. To maximize the log- likelihood we apply a standard optimization 
technique of conditional maximization as described in the supplementary 
text. 

4.2. GLS approximation. In settings where the variance parameter r\ 
does not vary much between the analyzed X matrices, an efficient ap- 
proximation can be found by estimating r\ only once and then applying 
a generalized least squares (GLS) method to approximate the ML estimates 
of (3 and a 2 for any given X matrix while r] is kept fixed. This idea has 
been implemented in the software packages EMMAX (Kang et al., 2010) 
and TASSEL (Zhang et al., 2010); similar ideas had been proposed earlier 
by Aulchenko, de Koning and Haley (2007). We will call this approach the 
GLS approximation to the full model. 

The GLS approximation is accurate only if rj does not vary much between 
different sets of predictors, for example, when the individual predictors ex- 
plain only a negligible proportion of the total variance of the response. In 
current human GWAS this is typically the the still unidentified ge- 

netic effects are small. For example, in our MS study there were no noticeable 
differences between the full likelihood analysis and the GLS approximation 
and in their simulation study Zhang et al. (2010) did not find significant 
differences in the statistical power between the two methods. However, if 
the data contain closely related individuals and individual genetic effects 
explain enough phenotypic variation, then the full likelihood analysis may 
have higher power than the GLS approximation as we demonstrate below. 
With our efficient implementation of the full model it is possible to study 
this in more detail than before. 

Family Example. We consider children of 25 independent families each 
with 6 full-siblings and a quantitative phenotype of whose variance 15 % is 
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a 


MM 


GLS 


LM 


EMPIRICAL 


1(T 3 


0.914 
(0.913..0.915) 


0.910 
(0.909..0.911) 


0.890 
(0.889..0.891) 


1(T 4 


0.762 
(0.757..0.767) 


0.751 
(0.746..0.757) 


0.708 
(0.702..0.714) 


THEORETICAL 


1(T 3 


0.914 
(0.913..0.914) 


0.903 
(0.903..0.904) 


0.887 
(0.886..0.887) 


1CT 4 


0.760 
(0.759..0.761) 


0.732 
(0.731. .0.733) 


0.702 
(0.701. .0.703) 


1CT 6 


0.338 
(0.337..0.339) 


0.293 
(0.292. .0.294) 


0.265 
(0.264..0.266) 


5 x 10~ 8 


0.145 
(0.144. .0.145) 


0.115 
(0.114..0.115) 


0.099 
(0.098..0.099) 



Table 2 

Power in family data. Columns: a, type I error rate; MM, linear mixed model; GLS, 
generalized least squares approximation; LM, standard linear model. Cells give estimates 
of power together with their 95% confidence intervals. The first two rows are based on the 
empirical type I error thresholds and the remaining four rows use theoretical thresholds. 

explained by a major gene and 8.5% by minor genes (heritability is 23.5%). 
The remaining 76.5% of the variation in the phenotype is independent of 
the family structure. 

We simulated 10 million such phenotypes and paired each with a set 
of simulated genotypes that were independent of the phenotype (given the 
family structure) . The minor allele frequency was chosen uniformly between 
0.25 and 0.5 and Hardy- Weinberg equilibrium (see e.g. Lynch and Walsh 
(1998)) was assumed. We used these data sets to get accurate estimates of 
the threshold values of the likelihood ratio statistic under the null hypothesis 
of no genetic effect down to type I error 10 -4 . 

We then simulated an additional one million phenotypes but this time 
tested the genotypes of the major gene that influenced each phenotype. 
Using the empirical threshold values (with their 95% confidence intervals) 
from the null simulations the top two rows of Table 2 show the power of 
the linear mixed model (MM), the GLS approximation and the standard 
linear model (LM) at type I errors 10~ 3 and 10 -4 . In both cases MM is 
more powerful than the GLS approximation which in turn is more powerful 
than LM. 

In practice, inferences in GWAS are based on the asymptotic large-sample 
properties of the test statistics. As mentioned in Section 2, a widely- used 
method for checking how well the asymptotics hold is to assess the ratio of 
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the medians of the observed and expected (chi-square) test statistic distri- 
butions, denoted by A (Devlin, Roeder and Wasserman, 2001). For a sample 
of 10 7 draws from the theoretical null distribution the (analytically calcu- 
lated) upper bound of the 95% confidence interval of A is 1.0014 whereas 
in our 10 7 null simulations we observed values 1.055, 1.031 and 1.280 for 
MM, GLS approximation and LM, respectively. Even though accounting for 
families has brought MM and GLS much closer to the asymptotic null dis- 
tribution compared to LM, both methods are still inflated with respect to 
the theoretical distribution in this example with a fairly small sample size. 
Note that the GLS approximation always results in smaller likelihood ratio 
statistics, and thus smaller A values, than MM since GLS does not maximize 
the full model under the alternative whereas MM does. 

A simple way to make the observed test statistics match better with the 
theoretical distribution is to divide them by their corresponding estimates of 
A, a procedure called genomic control (GC) (Devlin, Roeder and Wasserman, 
2001). In this example GC works well but since it treats all the variants the 
same it is not an ideal method for controlling for confounding in more com- 
plex scenarios where different loci have very different population genetic 
histories (Astle and Balding, 2009). Therefore we have not used it with the 
MS data set. 

Table 3 shows the ratios of some quantiles of the observed distributions to 
their theoretical values after genomic control, together with the theoretical 
95% intervals of those ratios assuming 10 draws from the null distribution. 
We observe no deviation from the theoretical distribution for the linear 
mixed model and only a slight deflation for the standard linear model but the 
GLS approximation is deflated throughout the range of quantiles considered. 
Whether this phenomenon is specific to the family data considered or holds 
more generally requires further investigation. The lower panel of Table 2 
shows power at the theoretical thresholds corresponding to type I error rates 
relevant in GWAS, after genomic control was applied to the one million 
non-null tests. The relative power difference between MM and the GLS 
approximation increases with decreasing type I errors. 

In this example MM was noticeably more powerful than the GLS ap- 
proximation, both at the empirical and theoretical thresholds, after making 
the inflated statistics comparable by genomic control. On the other hand, if 
neither empirical thresholds nor genomic control parameters were available, 
then the GLS approximation could be a more robust choice in small data 
sets as reflected by the observed A values in this example. 
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a 


EXPECTED 


MM 


GLS 


LM 


10~ a 


0.997.. 1.003 


0.998 


0.979 


0.990 


1(T 4 


0.992.. 1.008 


0.997 


0.973 


0.992 


ltr 5 


0.981. .1.019 


1.001 


0.969 


1.002 



Table 3 

Ratios of observed quantiles to expected. Columns: a, upper quantile; EXPECTED, 
theoretical 95% confidence interval for the ratio in 10 7 samples; MM, linear mixed model; 
GLS, generalized least squares approximation; LM, standard linear model. Values outside 

the interval are in bold. 

4.3. Comparing methods. Our conditional maximization (CM) algorithm, 
EMMA and the GLS approximation all make use of a decomposition of the 
R matrix requiring C(n 3 ) operations. 

(i) . EMMA requires an additional C(n 3 ) matrix decomposition for each 

set of predictors X whereas CM and GLS are 0{n 2 ) algorithms for 
each X given the initial decomposition of R. 

(ii) . EMMA reduces the problem to one-dimensional optimization for which 

the global maximum is in theory more reliably found than by using 
CM. 

(iii) . Parameterization of the model through r/ (CM) has a computational 

advantage over parameterization using 5 = (1 —rj)/rj (EMMA) since 
the maximization is easier over the compact set r\ € [0, 1] than over 
the unbounded interval 5 € [0, oo). 

(iv) . It is expected that the GLS approximation is computationally more 

efficient but in some cases less accurate in ML estimation, and less pow- 
erful in testing the predictors than either EMMA or CM, as demon- 
strated with the previous family example. 

We investigated through simulation studies how the above differences 
manifest themselves in practice, related to the reliability and running time 
of the algorithms. We applied the EMMA R-package v. 1.1. 2 (Kang et al., 
2008) with the default parameters and our C-implementation of the CM 
algorithm (software package MMM). For the time comparisons we also in- 
cluded a GLS approximation (our C-implementation in software package 
MMM) and a beta version of the algorithm FMM 1 (Astle, 2009). We note 
that the software package TASSEL relies on the EMMA algorithm in full 
ML estimation, and for the GLS approximation both TASSEL and EMMAX 
are similar to our GLS implementation. Therefore TASSEL and EMMAX 
were not included in these comparisons. 



Downloaded in March 2011 from http://astle.net/wja/ 
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max log-likclihood 


a 2 


V 


max A 


0.0053 


3.0 ■ 10~ b 


1.2 • 10~ b 


range 


(-1618.332, -1085.829) 


(0.601, 1.444) 


(0.029, 0.976) 



Table 4 

Maximum absolute differences between EMMA and CM and the ranges of the estimated 
quantities over 19,000 simulated data sets with 0.05 < n < 0.95. 

4.3.1. Reliability. The purpose of these tests is to assess whether condi- 
tion (ii) above has any practical effect on the variance parameter estimation. 
For each value of r\ G {0, 0.05, 0.1, . . . , 0.95, 1} we generated 1,000 data sets 
for n = 500 subjects. A single data set consisted of an R matrix and a Y 
vector. To create R we simulated non-zero elements of an n x n lower tri- 
angular matrix L from the standard normal distribution and set R = LL T 
with the extra condition that if some of the eigenvalues of LL T were < 10~ 3 
they were set to 10 -3 to guarantee that R was numerically positive-definite. 
(The largest condition number of R matrices was 1.5 x 10 6 .) Y was then 
simulated according to the model Y = g + e, where g ~ AA(0, r/R) and 
e ~ A/"(0, (1 — r])I). The ML estimates of r/ and a 2 were obtained from 
EMMA and the CM algorithm. Since FMM does not output the value of 
the maximized log-likelihood we have not included it in this comparison. 
Also, for these datasets, the GLS approximation is the same as the full 
model since we use each R matrix only once. Thus, no separate results for 
GLS are reported. 

The results for 19,000 data sets simulated with 0.05 < r/ < 0.95 were the 
same between the methods for all practical purposes (Table 4). In addition 
to being similar up to 3 decimal places, the optimized log-likelihood values 
had no tendency of being higher with one algorithm than with the other 
(p = 0.46 in the two-sided binomial test). 

When T) was on the boundary {0, 1} the CM algorithm found points where 
the log-likelihood was at least 0.01 higher than that found by EMMA in 
1,503 cases out of the 2,000 data sets (maximum of these differences was 
1.06). This is due to property (iii) above which requires EMMA to constrain 
the search to a compact subset of its unbounded search space. The size 
of the search space is a parameter of EMMA (we used the default values 
of —10 < log [S) < 10) and by increasing this interval higher likelihood 
values could be found also by EMMA, but with higher computational cost. 
Alternatively, one could parameterize EMMA using r\ instead of 5 in which 
case EMMA and CM would be expected to give the same results also on the 
boundary r/ € {0, 1}. 

Thus, even if in theory the CM algorithm does not have guaranteed con- 
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n 



Fig 4. Relative running times for 100 data sets compared to GLS, on the log-10 scale, 
as a function of the sample size n. Methods are the R-package EMMA v. 1.1. 2, our C- 
implementations of conditional maximization ( CM) and generalized least squares ( GLS) 
and a C -implementation of FMM (downloaded in March 2011). The figures below the 
GLS-lme are the GLS times in seconds. Note that GLS is less accurate than the other 
three methods which have fairly similar accuracy to each other. 



vergence to the global optimum, in practice it has found the same maxima 
as EMMA in all 19,000 cases with rj G {0.05, 0.95}. Furthermore, in 
the great majority of the remaining boundary cases n € {0, 1} the CM algo- 
rithm has actually found a point with a higher likelihood value than EMMA. 
Since we generated the covariance matrices randomly without any particular 
structure these results suggest that the CM algorithm is a reliable method 
for the general problem of ML estimation in the linear mixed model that we 
consider. 

4.3.2. Running time. In applications, such as genome-wide association 
studies, where a single covariance matrix R is repetitively used with several 
sets of predictors X, there is a large difference in the running times between 
CM and EMMA due to property (i) above. To investigate this, for each 
n € {250, 500, . . . , 2000}, we simulated a single R matrix and Y vector as 
above, together with 100 different X matrices. Each X had dimension n x 2 
and the first column was always vector 1 to model the population mean and 
the second column contained a randomly sampled binary vector where each 
element was 1 with probability 0.5 and otherwise. The likelihood ratio 
(LR) tests for the effects /?2 were carried out using EMMA, FMM and our 
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EMMA 


CM 


FMM 


GLS 


EMMA 




3.2xl0" 4 


0.089 


0.43 


CM 


8.1xKT b 




0.089 


0.43 


FMM 


0.0045 


0.0046 




0.34 


GLS 


0.029 


0.029 


0.027 





Table 5 

Maximum absolute pair wise differences between EMMA, CM, FMM and GLS in 
likelihood ratio statistic (upper diagonal) and n estimate (lower diagonal) over the 800 

data sets of Figure 4- 

implementations of the CM and GLS algorithms. 

Figure 4 presents the running times as compared to the GLS approxima- 
tion. We see that independently of the sample size EMMA takes about 100 
times the time of the GLS procedure reflecting the fact that EMMA carries 
out an additional n x ra-matrix decomposition for each of the 100 data sets. 

The relative efficiency of the GLS procedure over CM decreases as the 
sample size grows. This is because both methods do the data initialization 
(computation of X = U T X) similarly and this task takes a larger and larger 
proportion of the whole running time as n grows. A similar trend of decreas- 
ing relative difference is also present when the initial matrix decomposition 
is subtracted from the running times of CM and GLS (results not shown). 

The FMM algorithm is clearly faster than EMMA but slower than CM 
except for the smallest sample size n = 250. We are not able to comment on 
the putative sources of these differences since the methodological details of 
FMM have not been published. 

Table 5 shows the maximum differences between the methods in the like- 
lihood ratio statistics and the estimates of rj. We see that the results from 
EMMA and CM were again practically the same over all 800 data sets and 
even though FMM deviated slightly from the common results of EMMA and 
CM it was clearly closer to those two methods than to GLS. 

Given these results it seems that CM is a natural choice for likelihood 
inference in the linear mixed model (4.1) since it is much faster than EMMA, 
more accurate than GLS, and still computationally feasible whenever GLS 
is. 

4.4. MS data set. We applied the CM algorithm to the non-UK compo- 
nent of our multiple sclerosis GWAS data set (20,119 individuals and 520,000 
SNPs). After the initial matrix decomposition was completed (in 3 hours 35 
minutes), the running time was 19 minutes 10 seconds per 1,000 SNPs us- 
ing a single processor (Intel Xeon 2.50 GHz) and about 3GB of RAM. This 
means that the whole MS data set can be run in 7 days and 2 hours by 
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using the CM algorithm on a single processor. If instead one were to apply 
a method such as EMMA, which requires a separate matrix decomposition 
at each SNP, we estimate that the corresponding running time of the whole 
MS data set would be about 210 years. As noted earlier, in GWAS where 
genetic effects are small, the GLS approximation (including programs EM- 
MAX and TASSEL) is expected to give, in practice, the same results as the 
full likelihood analysis, and thus could also have been a possible choice for 
this data set. The running time of the GLS approximation on the MS data 
set is about 5 days 19 hours, that is, 18% less than that of CM. 

5. Bayes factors. A Bayesian framework provides a fully probabilistic 
quantification of the association evidence, which is a useful complement to 
the traditional frequentist interpretation in the GWAS context (Wakefield, 
2009). It also allows use of prior knowledge, for example a particular de- 
pendency between the allele frequency and the effect size. This possibility 
becomes more and more important as our understanding about the genetic 
architecture of complex traits develops (Stephens and Balding, 2009). 

In a Bayesian version of the linear mixed model (1.1), in addition to the 
priors (1.2) for the random effects, we adopt the following priors 

(/3, a 2 ) ~ Normal-Inverse-Gamma(m, V, a, b), 
i] ~ Beta(r, t). 

Here a,b,r,t > are scalar parameters, m is a K dimensional vector and V 
is a K x K matrix. In the supplementary text we describe the properties of 
these priors and show how to efficiently evaluate the marginal likelihood of 
the data. The marginal likelihoods allow comparisons between models that 
differ in the structure of the predictor matrix X (e.g. testing genetic effects 
in GWAS), in the prior distributions of the parameters (e.g. whether 77 = 0), 
or both. 

5.1. Bayes factors for genetic association. In the non-UK component of 
our MS data set (20,119 individuals, 520,000 SNPs) the extra time spent in 
computing the Bayes factors for SNP effects compared to computing only 
the ML estimates was 2 minutes 13 seconds per 1,000 SNPs (Intel Xeon 
2.50 GHz), that is, an increase of about 12% in the running time. Following 
previous work (WTCCC, 2007) we chose the prior distribution on the genetic 
effect to be centered at and have standard deviation of 0.2 on the log-odds 
scale, independently of the allele frequency. With this choice there was nearly 
a linear relationship between the logarithmic p-values and the logarithmic 
Bayes factors (Figure 5). This data set-specific relationship provides useful 
information about these two conceptually different quantities. 
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A. 102 SNPs B. Model comparison for eta 




2 4 6 8 10 12 20 40 

-Iog10 p loglOBF 



Fig 5. A. Comparing -loglO p-values and loglO Bayes factors in the non-UK component 
of our MS study for 102 SNPs taken to replication. 

B. Distribution of loglO Bayes factors between models rj ~ Uniform(0, 1) and rj = 0. 
For each value of r\ £ {0, 0.1, . . . , 1}, 100 data vectors Y were simulated and means 
±2x standard deviations of the corresponding loglOBF distributions are shown. The pro- 
portions of the data sets for which logl0(BF)>0 were 0.01, 0.45 and 0.98, for true value 
of rj being 0.0, 0.1 and 0.2, respectively, and 1 whenever n > 0.3. 



5.2. Estimating heritability from a population sample. Recently, Yang et al. 
(2010) estimated the proportion of the variance in human height that can 
be explained by a dense genome- wide collection of SNPs from a large sample 
of distantly related individuals. Here we demonstrate Bayesian computation 
by answering a related question of how high heritability (i.e. rj in our mixed 
model) needs to be in order to be detectably non-zero from a particular sam- 
ple of distantly related individuals. Note, however, that we do not interpret 
77 as heritability in our MS data set due to the confounding effects of the 
population structure. 

We consider a sample of n = 5, 340 UK individuals including 2,665 healthy 
blood donors recruited from the United Kingdom Blood Service (UKBS) 
and 2,675 samples from the 1958 Birth Cohort (1958BC). These samples 
have been used as common controls for several GWAS carried out by the 
Wellcome Trust Case-Control Consortium 2. Here we focus on the genotype 
data generated by the Affymetrix 6.0 chip. After a quality control process, we 
made use of a genome-wide set of S = 168,351 approximately independent 
SNPs to compute a pair- wise genetic correlation matrix R = (r^) for these 
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individuals by setting 



(5.1) 



1 



E 



s 




2p s ) 



8=1 



2p s (l -p s ) 



where a s is the number of copies of allele 1 that individual i carries at SNP 

(i) 

s, ag is similarly defined for individual j, and p s is the frequency of allele 
1 at SNP s in the whole sample of n individuals. The interpretation of 
is that of relative genome-wide sharing of alleles compared to an average 
pair of individuals in the sample. In particular, negative (positive) rij de- 
notes more distant (closer) relatedness than that of an average pair in the 
sample, for whom the correlation is rj,- = 0. The same matrix (divided by 
2) is called "kinship matrix" by Astle and Balding (2009) and, excepting a 
slight adjustment on the diagonal, it is also same as the "raw" relatedness 
matrix used by Yang et al. (2010). For other versions of genetic relationship 
matrices, see for example Astle and Balding (2009); Kang et al. (2008). In 
our data all non-diagonal elements of R were below 0.03 showing that there 
were no close relatives within this sample. 

We simulated 100 phenotype vectors Y for the individuals for each value 
of r\ e {0, 0.1, . . . , 1} from the distribution Y ~ M(0, r}R+(l- rj)I). We 
then compared two versions, Mq and M\, of the linear mixed model 

Y = p + g + e, with g ~ M(0, t]a 2 R) and e ~ M(0, (1 - r])a 2 I), 

where in both models the prior on (/?,cr 2 ) was NIG(m = 0, V = 10, a = 
10, d = 12) and in Mq : r] = and in Mi : rj ~ Uniform(0, 1). For each data 
set we computed marginal likelihoods p(Y\M\) and p(Y\Mq) whose ratio 
gives the Bayes factor (BF), which tells how the prior odds of the models are 
updated to the posterior odds by the observed data Y (Kass and Raftery, 
1995). In particular, if BF>1 (i.e. logl0(BF)>0) then the data favors model 
Mi over model Mq, and if BF<1 (i.e. logl0(BF)<0) then the opposite is true. 
Figure 5 shows the distributions of loglO(BF) for different (true) values of 
r\. The running time of computing BFs for all 1,100 data sets was less than 
5 minutes (Intel Xeon 2.50 GHz) after R had been decomposed once, which 
took another 4 minutes. 

We conclude that in our data set the model Mq is (correctly) favored in 
almost all the cases that were simulated with 77 = and that when the true 
7] > 0.3 then it is very likely that model M\ will be favored. Roughly speaking 
this means that in these individuals we expect that this model comparison 
procedure detects non-zero heritability for the phenotypes that truly have 
heritabilities 7? > 0.3. However, with real data there is a complication that 
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the estimated R matrix does not completely capture the true genome- wide 
correlation as only a subset of the relevant variation is used in estimating R 
(Yang et al., 2010). As a consequence, with real phenotype data the lower 
limit of a convincingly detectable rj is likely to be higher than in these 
simulations which have assumed that R is known exactly. 

In general the distribution of BFs depends on the sample size n, the 
relatedness structure R and the priors on r), and the formulae we have 
derived in the supplementary text provide a computationally efficient way 
to assess these dependencies in any particular data set. 

6. Discussion. Motivated by genome-wide association studies (GWAS) 
we have presented computationally efficient ways to analyze the linear mixed 
model 

Y = X(3 + q + e, with 

Q\{rj,a 2 ) ~ M(0,rja 2 R) and e\(rj, a 2 ) ~ M(0, (1 - r])a 2 I) 

in situations where many X matrices are analyzed with a single covariance 
matrix R. In the GWAS context the role of the random effect q is to control 
for those associations between phenotypes and genetic variants that can al- 
ready be explained by the genome-wide genetic sharing. The mixed model 
approach is especially useful when the study individuals show complex relat- 
edness structure which cannot be captured by including a few linear predic- 
tors in the model. Such a situation may arise if a case-control study combines 
individuals from several populations with differing case-control ratios (e.g. 
IMSGC and WTCCC2 (2011)) or if the sampled individuals contain close 
relatives, e.g. in studies of model organisms (Atwell et al., 2010; Kang et al., 
2008; Yu et al., 2005), domesticated animals (Boyko et al., 2010) or humans 
with recent pedigree information (Aulchenko, de Koning and Haley, 2007). 

For our case-control GWAS application (IMSGC and WTCCC2, 2011) we 
have derived an accurate transformation between the linear and logistic re- 
gression models when the predictors have only small effects on the response. 
Crucially, this allows interpreting the output from the linear mixed model 
on the log-odds scale, which is important in the GWAS context both for 
understanding the sizes of the genetic effects and for combining the results 
via meta-analyses across independent studies. 

We have also formulated a conditional maximization (CM) algorithm 
for maximum likelihood estimation which is an order of magnitude faster 
than the existing EMMA algorithm (Kang et al., 2008) and in our tests was 
also faster than the FMM algorithm (Astle, 2009), except with the smallest 
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sample size (n = 250). With the small effect sizes that are typical in cur- 
rent GWAS the full mixed model analysis (performed by CM, EMMA and 
FMM) gives very similar results to the generalized least squares approxi- 
mation (GLS) that has been implemented in EMMAX (Kang et al., 2010) 
and TASSEL (Zhang et al., 2010). However, in other genetics contexts the 
full mixed model may be more powerful than the GLS approximation as we 
demonstrated with an example that contained close relatives and genetic 
variants with large effects. Given that our CM approach is computationally 
only slightly more demanding than the GLS approximation, (by about 20% 
in running time in our large MS data set), it seems well-suited for routine 
use in genetics applications. 

We also considered computation of Bayes factors for the genetic associa- 
tions as well as for the variance components. Another possible application 
of the Bayesian model is in predicting an unobserved response yi based on 
the set of observed values Y _j = (y±, . . . , yi-\, yi+i, • • • , y n ) T and the model 
M (which contains information on priors, X and R). The required posterior 
p(yi\Y_i, M) oc p(yi,Y _i\M) can be efficiently calculated on a grid of pos- 
sible values yi by using the methods described in the supplementary text. 
These calculations are especially simple if response yi is restricted to a set 
of discrete values as is the case with binary data. 

A natural question is whether the efficient computational solutions pre- 
sented in this article could be extended to linear mixed models with more 
random effects. This would be important for example in analyzing gene ex- 
pression data by including both the genetic relatedness and the expression 
heterogeneity as random effects (Listgarten et al., 2010). The key issue that 
made the CM algorithm fast in our application was the ability to diagonalize 
the full covariance matrix XI by using an orthonormal matrix U which did 
not depend on the variance parameters, or in other words, R and / were 
simultaneously diagonalizable by the same orthonormal U. More generally a 
set of symmetric matrices is simultaneously diagonalizable by an orthonor- 
mal matrix if and only if the matrices commute (Thm 4.18 in Schott (2005)). 
Thus the computational strategy that we used here generalizes straightfor- 
wardly only to a rather special case of commutable covariance matrices. 
In other situations an approximation to the full model could be achieved 
by the generalized least squares approximation where the variance param- 
eters are estimated only once and then kept fixed for the repeated analysis 
of different predictor sets (Kang et al., 2010; Zhang et al., 2010). On the 
other hand, an efficient generalization of both CM and EMMA to multiple 
response vectors Y is straightforward since the necessary matrix decompo- 
sitions do not depend on Y. This feature was utilized in our example of 
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heritability estimation. 

Extending linear mixed models to proper variable selection models that 
simultaneously analyze several thousands of predictors is also an important 
topic. Further work is required to determine whether the computational 
solutions presented in this work can help implement more complex variable 
selection models. 

Even though GWAS and other genetics applications have given the main 
motivation for this study, our results are more generally valid for any appli- 
cation that fits into the framework of the standard linear model with one 
additional normally distributed random effect. 
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SUPPLEMENTARY MATERIAL 

Supplementary text: 

(http://lib.stat. emu. edu/aoas/???/???). In this supplement we give the de- 
tails of the application of the mixed model to binary data, of the conditional 
maximization of the likelihood function and of the Bayesian computations. 

Software: MMM 

(http:/ /www. iki.fi/mpirinen). We have implemented the CM algorithm, the 
GLS approximation, the log-odds estimation procedure and the Bayes fac- 
tor computation in software package MMM. The C-source code is publicly 
available under the GNU General Public License. 
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In this supplement to "Efficient Computation with a Linear Mixed 
Model on Large-scale Data Sets with Applications to Genetic Stud- 
ies" we give the details of the application of the linear mixed model to 
^ , binary data, of the conditional maximization of the likelihood func- 

■ tion and of the Bayesian computations. 



PL, 
< 



Throughout this text we consider the linear mixed model 
(0.1) Y = X/3 + g + e, 



(— > 

where Y = (y±, . . . , y n ) 2 is the vector of responses on n subjects, X = (x^) 
is the n x K matrix of predictor values on the subjects, f3 = (/3i, . . . , (3k) T 
collects the (unknown) linear effects of the predictors on the responses Y 
' and random effects q and e are assigned distributions 

00 ' 

OO; (0.2) g| (77, a 2 ) ~ jV(0, rja'^R) and e|(r/, a 2 ) ~ M(0, (1 - r/)cr 2 /), 

■ where R is a known positive semi-definite n x n matrix, i" is the n X n 

identity matrix, and parameters a 2 > and 77 € [0, 1] determine how the 
variance is divided between g and e. 

1. Binary data. For 0-1 valued responses Y . . . , y n ) T , a logistic 

^ 1 regression model assumes that 

T3 ( hy- \ exp(a + Xi7) 
Pi = P{Vi = l|X,a,7) - 



1 + exp(a + Xi~y) 



where the row i of X is denoted by Xi, the effects of the predictors are 
in vector 7 and a is the population base-line effect. (Note that here the 
base-line effect has been explicitly separated from the X matrix.) The log- 
likelihood function for exchangeable observations is 

n 

Lb{a,l) = logP(V|a,7) = ^ {ydogpi + (1 - yi)log(l -pi)) , 

i=i 

1 
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where we use the subscript b to denote "binary" . 

First order approximation (FOA). By treating a and 7 as known, by 
mean-centring the predictors and by expanding pi as a Taylor series around 
the mean, Xi = 0, we have 



e a (l 

l + e a ' (l + e a ) 2 "' ' ' 2(1 + e a f 



(1-1) Pi - 7^ + 7^W^7+^- y(^7) 2 + 



When the effects are small on the log-odds scale in the sense that |.XVy| 
is small, then (Xi-f) 2 ~ and the probability pi is accurately approximated 
by a linear function of the predictors pi ~ fi + constrained to lie in 

[0,1]. According to (1.1), the parameters are transformed between logistic 
(0,7) and linear (n,P) scales as 



(1.2) 



a =log( T ^j, 7fc =^r^, for k = 1,...,K, 
V = TT^' = Ik TrSsV i for k = l,...,K. 



l + e a, — rk (l+ e a)^ 1 

The score and the Hessian of the logistic regression model are 

(1.3) 3 J± = X* ( Y-p) 

(1.4) = -X T diag(R(l- K ))X, 

where we have included the base-line parameter a in 7 and augmented X ac- 
cordingly with a column of ones, and p = (pi, . . . ,p n ) T is a function of 7. By 
using the small-effect approximation p ~ Xf3 as in the derivation of (1.2), 
but now with fx included in (3, we see that the score (1.3) is approximately 
zero at the least squares estimate f3 = (X T X)~ 1 X T Y . Thus, if the assump- 
tion of small effects is valid, an application of the least squares method (i.e. 
the maximum likelihood (ML) estimation in the standard linear model) to 
binary data and the transformation of the parameters to the log-odds scale 
using (1.2) should give a good approximation to the ML estimates of the 
logistic regression model. The sampling variance of the coefficients could be 
approximated by the inverse of the negative Hessian (1.4) at the estimated 
maximum or by transformation (1.6) explained below. Despite the simplic- 
ity of this linear approximation, we are not aware of its previous formal 
derivation, although similar ideas have been applied before, for example by 
Denby, Kafadar and Land (1998). From now on we call it the first order ap- 
proximation (FOA) to distinguish it from a more accurate approximation 
that we have established particularly for our genetics application. 
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Fig 1. Comparing the first order approximation (FOA) and the GWAS approximation. 
Panels include 4, 500 binary variants considered in Figure 2 of the main text. The leftmost 
column shows relative errors (in percentages) between the FOA and the maximum likeli- 
hood estimates from the logistic regression model as a function of the estimated log-odds 
ratios (top) and standard errors (bottom). The rightmost column shows similar results 
for the G WAS approximation. The middle column shows histograms of the differences be- 
tween absolute values of the relative errors (in percentages) from the FOA and the G WAS 
approximation, truncated from above at 5%. logOR, log-odds ratio; SE, standard error. 



GWAS approximation. We consider a GWAS setting in which the case- 
control status is regressed on the population mean and the reference allele 
count. By examining the second and third order terms of series (1.1) and 
carrying out some empirical testing we found that the relative differences 
between the log-odds estimates from the FOA and the ML estimates from 
the logistic regression model are accurately described by 

(1.5) r(7, 6, <j)) = 0.5(1 - 20) (1 - 20)7 - (0.084 + 0.90(1 - 20)61(1 - 0))f, 

where 6 is the frequency of the reference allele, 7 is the log-odds estimate for 
the reference allele from the FOA and <j) is the proportion of cases in the data. 
This can be used for adjusting both the estimates: 7 = 7/(1 + r(j, 6, (/))), 
and their standard errors. Figure 1 above shows the improvement of this 
GWAS approximation over the FOA. 

Mixed model. When we model the binary responses Y as correlated 
according to the variance structure a 2 ~S where the matrix X is known, 
an analogous estimate of the parameters is the generalized least squares 
(GLS) solution /3 = (X T £~ 1 X)- 1 X T £~ 1 Y' which can be transformed to 
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the log-odds scale using (1.2) (and possibly adjusting by the GWAS ap- 
proximation). In this case the sampling variance on the linear scale can 
be approximated using the GLS estimate V p = a 2 (X T Yl~ l X)~ 1 , where 
a 2 = -{Y — JC/3) T S _1 ("K — X(5). The corresponding estimates on the log- 
odds scale are then given by the delta-method: 



(1.6) = JV/3J T , where J VJ 




(3=13. 



We note that in the most general case of our linear mixed model (0.1) the 
covariance structure X! = 7]R + (1 — rj)I includes the parameter r/. If rj is 
estimated by maximum likelihood we cannot any more justify the method 
as a pure least squares method. In any case the empirical results in the main 
text suggest that the procedure works well in our application. 

The standard way of finding ML estimates for logistic regression is known 
as iteratively reweighted least squares (Nelder and Wedderburn, 1972). As 
an instance of the Newton-Raphson algorithm it is based on the second 
order Taylor series approximation of the log-likelihood and results in a se- 
ries of least squares problems where the outcome variable and the diagonal 
covariance matrix vary between each iteration. In contrast, our first order 
approximation is based on the linear approximation of the probabilities pi 
(not the log-likelihood) and is available after a single application of the least 
squares method to the original binary data, but with a downside that it is 
accurate only in the case of small effect sizes. 

Equivalence between the trend test and the linear model. Above we showed 
how the linear model can estimate the effects on the log-odds scale. Next 
we give another justification for the application of the linear model to case- 
control data by showing that for large sample sizes the likelihood ratio test 
for the SNP effect in the standard linear model is equivalent to the Ar- 
mitage trend test of the genotype counts (Armitage, 1955). The trend test 
is widely-used for analysing case-control GWAS and in this context is also 
equivalent to a score test of a logistic regression model. Previously, connec- 
tions between the trend test and the linear model in the GWAS context 
have been discussed by Kang et al. (2010); also Astle and Balding (2009) 
give conditions under which the linear model can be applied to case-control 
data. 

Suppose that we have genotype data on S cases and R controls with 
n = S + R, and denote the mean-centred genotype of individual i by Xj and 
the binary phenotype by yi £ {0, 1}. The trend test-statistic can be written 
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as T 2 /V where 

^ i&S U i&R in i&S 

1 n 

v = — Vi 2 

i=i 

and it has an asymptotic Xi _ distribution under the null hypothesis of no 
linear trend in the genotype frequencies between the cases and the controls 
(Astle and Balding, 2009). 

The maximum likelihood estimates for the linear models Mq : ~ 
A/"(/Uo,o"o) and M\ : yi ~ Af(fii + f3iXi,a\) are 

Ao = <j>, 
Ai = <P, 

1 " 



n . 
i=i 



n f^i "(Ei=i » 



where = S/n. The likelihood ratio statistic is 



Li(fii,pi,of, 
2 log — = nlo; 




Here the convergence is derived from a basic property of the exponential 
function: {l+a/n) n — > e a as n — > oo, for any real value a. Thus the likelihood 
ratio statistics approaches the trend test statistic as n — ) oo. 
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2. Likelihood analysis. The log-likelihood function for model (0.1) is 

L(f3, rj, a 2 ) = c - | log(a 2 ) - \ log(|E|) - ^ - X/3f ^(Y - X(3), 

where S = r]R + (1 — 77)/, c = — 8 log(27r) and |S| denotes the determinant 
of £. 

The eigenvalue decomposition of the positive semi-definite matrix R yields 
an orthonormal n x ?7,-matrix U of eigenvectors and a diagonal n x n- 
matrix D of non-negative eigenvalues for which R = UDU T (see e.g. 
Golub and Van Loan (1996)). Because UU T = I (orthonormality) it fol- 
lows that 

X = rjR+(l-r))I 

= rjUDU T + (1 -n)UIU T 
= U(r]D + (1 - n)I)U T , 
E" 1 = U(r]D + (1 - r / )/)~ 1 (7 T , 

n 

isi = n( i +^- 1 ))' 

where (ij is the ith eigenvalue of -R, that is, the element of D. The 
inverse X -1 is defined for all n € [0, 1] unless some di is zero in which case 
we restrict the model to the values r/ < 1. 

By transformations Y = U T Y , X = U T X , and S = rjD + (1 — rf)I the 
log-likelihood becomes 

L((3,r],a 2 ) 
= c--log(a 2 ) 

= c-^log(a 2 ) 

where \Y — X(3]i is the ith. element of vector Y — X(3. For each set of values 
of the parameters, the evaluation of the log-likelihood requires 0{nK) basic 
operations, where n is the number of individuals (rows of X matrix) and K 
is the number of predictors in the model (columns of X). 

2.1. Conditional maximization. To maximize the log-likelihood we use a 
standard optimization technique of conditional maximization. After initial- 
izing the parameters to values r](°\ (<7 2 )(°)), we iterate the following 



1 



1 



log(|£|)-^0r-X/3) T 5TV 



-$>g(l + r,(ck-l)) -J2 



i=l 



X(3) 

' 2<r 2 (l + 77(^-1)) 
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three step process until convergence: 

pQ+i) = argmaxL(/3,7/ J '),(a 2 )W) 
(a 2 )( j+1 *> = argmaxL(/3 (j+1) ,7? (j ' ) ,a 2 ) 

a 2 

v (j+i) = argmaxL(/3( J ' +1 \??,(a 2 )( J+1 )), 

where the superscripts in parentheses denote the iteration. We have not 
established theoretical conditions which would guarantee that the process 
finds the global maximum, but we know that conditional on rj we always find 
the global maximum with respect to (3 and a 2 . Furthermore, in the compar- 
isons with the EMMA algorithm we have not found a single data set where 
the algorithm would have failed (see the main text). If such exist in some 
applications, then one could run the algorithm several times starting from 
different initial values. Steps 1 and 2 are done analytically by using standard 
results on linear models, and step 3 is done by numerical maximization using 
some ideas from Kang et al. (2008). 
Step 1: The derivative 

is zero at f3 = (X J] X)~ 1 X S Y assuming that matrix X is of full col- 
umn rank. Under the same assumption the Hessian Cpr(/3) = —-^iX XI X 
is negative definite and thus the function j3 — > L((3,rj,a 2 ) attains its global 



maximum at (3. 

Step 2: The derivative 



dL(f3, V ,a 2 ) _ n J_"([Y-X0\i 



8{a 2 ) 2d 2 2c7 4 ^ (1 + 7,(^-1)) 

is zero at a 2 = — , where A = y]?-i rrr-^r^rri • The second derivative 

gCTp (°" 2 ) — < thus showing that the function a 2 L(j3,7],a 2 ) 

attains its global maximum at a 2 . (Actually, since the value f3 in step 1 
does not depend on a 2 , the steps 1 and 2 together give the global maximum 
of the function (/3, a 2 ) — s> L(f3, rj, a 2 ).) 

Step 3: We use a Newton- Raphson method to find zeros of the derivative 



dL{f3^a 2 ) 1 " di-l ( ([Y-Xp]tf 



dri ~ 2 §1 + 7/(^-1) 1^(1 + ^-1)) 



1 
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We divide the interval [0, 1] into m subintervals by points {0, — , . . . , ^E-, 1} 
and evaluate the derivative at each of these points. If the sign of the deriva- 
tive changes from positive to negative within an interval, we apply the 
Newton-Raphson algorithm to find a zero within that interval. Finally we 
choose the maximum of the log-likelihood values among the local maxima 
(zeros of the derivative) or the values at the endpoints. A problem would 
occur if there were several zeros within a single interval because this algo- 
rithm would find at most only one of them. To reduce chances of such an 
event one should in principle use a relatively large number of subintervals 
m. In our examples, we have used m = 10. 

2.2. The second derivatives. Asymptotic likelihood theory allows us to 
estimate the standard errors of the parameters by using the inverse of the 
observed information matrix I at the MLE. The elements of I are 



d 2 L 

dOidOj 



where (0\, . . . , 9k +2) = (Pi, ■ ■ ■ > Pk,Vi v 2 )- Straightforward calculations show 
that the second derivatives are 

dp 2 a 2 



d/3d(a 2 ) a 

d 2 L _ 1 " {di-l^Y-Xp^Xik 



dMv ° 2 f^i (1 + V(di ~ I)) 2 

d 2 L jn_ 1 " ([Y-XPjj) 2 

d(a 2 ) 2 2cj 4 a 8 ^ 1 + n{di - 1) 



d 2 L J_ " K-l)([r-X/3],) 2 

" 0^.4 



d(a 2 )drj 2a 4 ^ (1 + r/(d; - l)) 2 

cPL 1 " / dj-1 \ 2 ( 2([y-X/3] t ) 2 ' 
dr, 2 2^\l + V (di-l)J I ^(l + T/Cdi-l)) 



3. Bayesian computation. In a Bayesian version of the mixed model, 
we combine the sampling distribution 



Y\(P, a 2 , n) ~ M{XP, V a 2 R + (1 - r/)^ 2 /), 
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with the prior distributions 

(/3, cr 2 ) ~ Normal-Inverse-Gamma(m, V, a, b), 
i] ~ Beta(r, t), 

where a, b, r, t > are scalar parameters, m is a K-dimensional vector and V 
is a K x if-matrix. These choices of prior distributions lead to the following 
marginal properties: 

(3 ~ t 2a (m, 2bV) E(/3) = m Var(/3) = ^ V 

ff 2 ~Inv-Gamma(a,6) E(<7 2 ) = ^_ Var(cr 2 ) = {a _^ {a _ 2) 
i] ~ Beta(r, i) E(t?) = ^ Var (77) = (r+f) ^ +m) . 

An intuitive description of the Normal-Inverse-Gamma (NIG) distribution 
is that a pair (/3,cr 2 ) is generated from NIG(m, V, a, b) by first sampling 
cr 2 ~ Inverse-Gamma(a, b) and then (3\a 2 ~ J\f(m, a 2 V). 

The most notable restriction of these priors is that (3 and a 2 are a pri- 
ori dependent (see O'Hagan and Forster (2004)). To adjust the prior in a 
particular setting it is often helpful to standardize both the quantitative 
responses and each continuous predictor. The GWAS software SNPTEST2 
uses a similar prior distribution for analyzing quantitative traits with the 
standard linear model and some guidelines for prior specification can be 
found in its manual 1 . 

The steps to carry out analytic integration of /3 and a 2 in the mixed 
model considered here have the same form as the corresponding steps in 
the general linear model (O'Hagan and Forster, 2004). This is an advantage 
of our parameterization compared to a previous treatment of this mixed 
model by Sorensen and Gianola (2002); for details of the differences, see the 
discussion at the end of this section. Another novelty of our work is to show 
that the marginal likelihood computations can be done efficiently using the 
same matrix decomposition that was introduced for ML estimation in the 
previous section. This is crucial in order that a large number of X matrices 
can be analyzed efficiently in the Bayesian framework. To our knowledge, 
this topic has not previously been considered in the literature. 

3.1. Computation. As before, the likelihood part of the model is 
P (Y\f3, a 2 , V ) = (2tt)-5 |a 2 S|-i exp (Y - X(3) T ^\Y - Xf5)^j 

www. st at s . ox . ac . uk/ ~ marchini/ software / gwas /snptest 
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with S = rjR + (1 — rj)I and the prior density p((3, a 2 , rj) = p{rj)p((3, a 2 ) is 
composed of 

ua( 2\-(a+l+K/2) / i 

PiP** 2 ) = T^WI^TTnF^^V [-—, (((3 - m) T V~\(3 - m) + 2b 



(27r)^/ 2 |F| 1 /2r(a) F V 2a 2 

By direct calculation it can be verified that 

(Y - X/3) T S- X (Y - X(3) + ((3- m) T V~ l {f3 - m) 
= Y T T,~ 1 Y + m T V~ 1 m- (m*) T (V*y 1 (m*) + 
+ ((3-m*) T (V*y 1 (f3-m*), 

where 

V* = (V- 1 + X T 'E- 1 X)- 1 
m* = V*{V~ 1 m + X T 'E- 1 Y). 

With this notation the joint density P(Y,(3,a 2 ,r]) is 

p{Y\{3,a 2 ,rj)p{P,a 2 ,rj) 

= Pin) x ; n-\-K - - -i m 5 x 
(2tt)— T(a)\V\2 

x exp (-2^(03 - m*) T {V*r\(3 - m*) + b*] 

where b* = 2b + Y^E^Y + m T V 1 m - (m*) T (V*)' 1 (m*) . By noticing 
that as a function of (3 the above density is proportional to the density of 
J\f(m*, a 2 V*), we are able to integrate analytically with respect to (3: 



i 



(2^T(a)\V\ 1 2 \\V*\J V ' "V 2a 



As a function of a the above function is proportional to the density of 
Inv-Gamma (^a + ^-J allowing us to calculate 

b a T(a + ^) fb*Y {a+ ^ f \V*\ V 



(3 ' 1) " (y -" ) = (i^fF(#UJ [mm) p( * 

As a function of 77, the density (3.1) is proportional to the posterior of rj, and 
thus evaluating it at a grid over the interval [0, 1] allows us to do inference 
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on 77 and approximate the marginal likelihood of the data, P(Y), by inte- 
grating (3.1) numerically. A Bayes factor, that is, the ratio of the marginal 
likelihoods of two models, can thus be calculated between models that dif- 
fer in the structure of the predictor matrix X (e.g. testing genetic effects in 
GWAS), in the prior distributions of the parameters (e.g. whether rj = 0), or 
both. Next we show how to do these computations efficiently by exploiting 
the same eigenvalue decomposition of R and transformed variables X and 
Y that were introduced for maximum likelihood estimation in the previous 
section. 

Consider the density (3.1), that is, 



I 1 n \ / I V*| \ 5 

p(Y,rj) = c x (6*)~^ a+ 2^ (^T^rJ p(v)> where c 



(2fr) a r(a + §; 

1 



TT2T(a)\V\ 



_> 



is independent of rj. The goal is to integrate this over the interval r\ € [0, 1], 
for example, by evaluating it at a grid of m equally spaced points in [0, 1] 
and by using the trapezoidal rule. 

First we notice that p(rj) and |S| do not depend on X and thus we evaluate 
them once at the given grid points and store the results for repeated use with 
different X matrices. A similar idea is applied to the first three terms of the 
quantity 

b* = 2b + Y T JT X Y + m T V~ l m - (m*) T (V*)" 1 (m*). 
The only X dependent quantities are thus 

V* = (V- 1 + X T 5T 1 Xr 1 and 



m 



V*(V~ l m + X T T l ~ 1 Y). 



Suppose that the analyzed X matrices differ from each other only in one 
predictor which is stored in the last column K of X. Then only the element 
K of the vector X T Yl~ 1 Y needs to be recomputed: 

(X T ) K .VY = {X T ) K .ufT X U T Y = (X^k.V-'y = ± * iK *' = , 

^ rj{di - 1) + 1 

where Y = U T Y,X = U T X, and S = r]D + (1 — rf)I as in the previous 
section and {X t )k» denotes the row K of matrix X T . Similarly we can 
recompute the elements 

(X^^X)^ = T lK ij , for j = 1, ... , K. 
^ ' 3 f^v(di -1) + 1' J 
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Since X and Y have already been computed for the ML estimation by 
the conditional maximization algorithm, the additional complexity of these 
Bayesian computations is 0(mKn) operations. (Here we assume that K <C 
n so that the complexity of the matrix operations on K x K matrices is 
negligible compared to the operations involving all n individuals.) 

An existing Bayesian treatment of the linear mixed model considered uses 
parameterization with two variance components a 2 and a 2 which are related 
to our parameters as of = (1 — r\)a 2 and a g = rja 2 (Sorensen and Gianola, 
2002). When independent Inverse-Gamma priors are assigned to a 2 and a 2 
it seems that it is not possible to analytically derive their one-dimensional 
marginal distributions (p. 323 Sorensen and Gianola (2002)). Thus, it seems 
that our parameterization has an advantage in computing marginal likeli- 
hoods since we only need to integrate numerically over the one-dimensional 
compact set rj € [0, 1] as opposed to the unbounded two-dimensional set 
(<T 2 ,a 2 g ) e (0,oo) x (0,oo). 
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